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Abstract: Voltage-gated channel (VKC) plays important roles in biology procession, 
especially in nervous system. Different subfamilies of VKCs have different biological 
fiinctions. Thus, knowing VKCs' subfamilies has become a meaningM job because it can 
guide the direction for the disease diagnosis and drug design. However, the traditional 
wet-experimental methods were costly and time-consuming. It is highly desirable to 
develop an effective and powerflil computational tool for identifying different subfamilies 
of VKCs. In this study, a predictor, called iVKC-OTC, has been developed by 
incorporating the optimized tripeptide composition (OTC) generated by feature selection 
technique into the general form of pseudo-amino acid composition to identify six 
subfamilies of VKCs. One of the remarkable advantages of introducing the optimized 
tripeptide composition is being able to avoid the notorious dimension disaster or over 
fitting problems in statistical predictions. It was observed on a benchmark dataset, by using 
a jackknife test, that the overall accuracy achieved by iVKC-OTC reaches to 96.77% in 
identifying the six subfamilies of VKCs, indicating that the new predictor is promising or 
at least may become a complementary tool to the existing methods in this area. It has not 
escaped our notice that the optimized tripeptide composition can also be used to investigate 
other protein classification problems. 
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1. Introduction 

Ion channels located in the surface of cell membrane can maintain the balance of cell 
microenvironment by selectively penetrating ions and organic molecules in and out of cells. The 
channel has been found in all living organisms [1]. The voltage-gated channel (VKC), which is the 
largest family of channels, specifically controls the movement of under the stimulation of 
voltage changes in the cell's membrane potential. During action potentials, they play crucial roles in 
returning the depolarized cell to a resting state [2]. They are also key components in generation and 
propagation of electrical impulses in nervous system. The mutations in VKC genes can lead to severe 
diseases, such as long QT syndrome and epilepsy [3]. Thus, VKCs have become valuable targets for 
disease diagnosis and drug design. 

VKCs have four subunits. Each subunit comprises six transmembrane helices. A re-entrant loop 
forms the ion-selective channel, highly variable C- and A^-terminal domains (Figure 1). According to 
the A^- and C-terminal domains, VKCs can be grouped into different subfamilies. The proteins in these 
subfamilies are flinctionally divergent. Different subfamilies of VKC proteins have different sensitivity 
to the membrane potential and response to changes in potential [2]. Therefore, recognition of 
subfamily type of a new VKC is benefit to understand its biological functions. However, the traditional 
biochemical methods were costly and time-consuming. Thus, it is necessary to develop effective 
computational methods to identify subfamilies of VKCs. 

Figure 1. Schematic representation of potassium (K^) channel subunit. The SI, S2, S3, S4, 
S5, S6 are six transmembrane helices. 



In the past decade, some scholars have focused on the identification of VKCs families. Liu et al. [4] 
proposed a dipeptide-based method to predict five subfamilies of VKCs. Subsequently, Chen and Lin [5] 
developed an SVM-based model to predict six subfamilies of VKCs by using the Correlation-based 
Feature Subset Selection algorithm (CFSS) to select the optimal features. All these methods could 
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yield quite encouraging results, and each of them did play a role in stimulating the development of this 
area. However, further work is needed due to the following reasons, (i) The predicted successful rate is 
still far from satisfaction; (ii) No web-server was provided to most of these methods, and, hence, their 
usage is quite limited, especially for the majority of experimental scientists. 

The present study was initiated in an attempt to improve the prediction of VKC subfamilies from 
the above two aspects. According to a comprehensive review [6], to establish a really useful statistical 
predictor for VKC subfamily prediction, an objective benchmark dataset was constructed. 
Subsequently, a feature selection technique was used to obtain the optimal tripeptides. The support 
vector machine was used to operate the prediction. The jackknife cross-validation test was utilized to 
estimate the accuracy of the predictor. Finally, we established a user-friendly web-server for the predictor. 

2. Results and Discussion 

2.1. Benchmark Dataset 

The raw dataset of VKCs were extracted from the updated Voltage-gated Channel Database 
(VKCDB) [2] and filtered by VKCPred [5]. The following steps were used to construct a reliable 
benchmark dataset. At tirst, if the primary structure (amino acid sequence) of a VKC contains 
ambiguous residues, such as "B", "X", and "Z", the VKC will be removed; Secondly, if the sequence 
is fragment of other proteins, it will be excluded because its information is redundant and fragmentary; 
Thirdly, to objectively evaluate the proposed predictor, the CD-HIT software [7] was used to remove 
highly similar sequences by setting the cutoff of sequence identity to 60%. As a result, we obtained the 
benchmark dataset S as formulated by: 

S = SiUS2US3US4US5US6 (1) 

where the subset Si contains 82 Kvl subfamily proteins, S2 contains 16 Kv2 subfamily proteins, 
S3 contains 37 Kv3 subfamily proteins, S4 contains 32 Kv4 subfamily proteins, S5 contains 10 Kv6 
subfamily proteins and S(, contains 40 Kv7 subfamily proteins (Table 1) and where U represents the 
sjmibol for union in the set theory. For readers' convenience, the 217 VKCs can be freely downloaded 
from our webserver. 

Table 1. Breakdown of the 217 voltage-gated K"^ channels (VKCs) in the benchmark 
dataset S according to their six subfamilies. 



Dataset 


Channel Subfamilies 


Number of VKC Samples 


Si 


Kvl 


82 


S2 


Kv2 


16 


S3 


Kv3 


37 


S4 


Kv4 


32 


Ss 


Kv6 


10 


Ss 


Kv7 


40 


S 


Overall 


217 
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2.2. The Tripeptide Composition 

To develop a sequence-based predictor for the prediction of the subfamilies of VKCs, one of the 
keys is to formulate its sequence with an effective mathematical expression that can truly reflect the 
intrinsic correlation with the types to be predicted. The most straightforward method to formulate the 
sample of a VKC protein P with L residues is to use its entire amino acid sequence, as can be 
formulated by: 

P = RiR2R3R4...R£ (2) 

where Ri represents the 1st residue of the protein P, R2 represents the 2nd residue of the protein P, and 
so forth. According to a recent review [8], the general form of PseAAC for a protein P is formulated by: 

P = [¥i^2...^«...^nf (3) 

where the subscript Q is an integer and its value, as well as the components ^u{u= 1,2, D), will 
depend on how to extract the desired information from the amino acid sequence P {cf. Equation (3)). 

Tripeptide is a useful and minimal biological recognition signal which can be used for studying 
molecular modulators of biological function [9] and predicting plausible structures for oligopeptides as 
well as de novo protein design [8]. Thus, we extract tripeptide composition from the benchmark 
dataset S to define the components in Equation (3) for the VKC samples concerned in this study. Then 
a VKC sequence can be formulated by: 

P~ [/ij/i'" ■ "'yj'' ■ "'/sooo] (4) 

where sjmibol T denotes the transposition of vector and the^ is the frequency of the z-th (z = 1, 2, . . ., 8000) 
tripeptide in the VKC and expressed as: 

fi=nJ{L-2) (5) 

where and L denote the occurrence number of the z-th tripeptide and the length of the VKC 
sequence, respectively. 

2.3. Feature Selection 

If all 8000 tripeptides are used for prediction, the predictive result isn't usually satisfactory, such as 
low generalization ability of prediction model and poor prediction results because irrelevant features 
and noise is included. On the other hand, it is time-consuming to analyze an 8000 dimensional vector 
for large amounts of proteins. Using feature selection techniques to optimize feature set can not only 
gain deeper insight into the intrinsic properties of VKCs, but also improve understandability, 
scalability, possibility, and accuracy of the proposed models. Moreover, it can also economize the time 
for model construction and prediction. 

Although many dimensionality reduction techniques such as principal component analysis 
(PCA) [10,11], diffusion Maps [12] and minimal-redundancy-maximal-relevance (mRMR) [13,14] 
have been proposed to perform feature selection, none of them concerned the statistical significance of 
the features. According to this, we proposed the binomial distribution to investigate the statistical 
significance of each tripeptide and the optimal the feature set. 
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Each of the 8000 tripeptides occurring in one subfamily may be a stochastic event, thus, we must 
calculate the confidence level (CL) of each tripeptide occurring in different VKC subfamilies. For a 
stochastic event, two possible cases that are occurrence and non-occurrence will happen when one 
observes the i-th tripeptide occurring in the A:-th VKC subfamily. Each outcome has a fixed probability 
when benchmark dataset has been fixed. This probability is called prior probability and defined as: 

1 8000 /■r— I 6 T— 1 8000 



Z8000 /V^" V^™"" 
M %/z..=i2.,=i % (6) 

where ^^.j^^^T"* denotes the total occurrence number of all tripeptides in the benchmark dataset. 
^8o»o^^^ is the occurrence number of all tripeptides in the A:-th VKC subfamily. The riik represents the 

number of the i-th tripeptides occurring in the A:-th VKC subfamily. Correspondingly, the probability 

of the non-occurrence in the A:-th VKC subfamily is defined as = 1 ~Pk- 

Let N. = y^ n. represents the total occurrence number of the z-th tripeptide in benchmark dataset. 

That is to say, under the condition of the prior probability pk, one performs trial or observation with Ni 
times. We may calculate the posterior probability P,vt of the i-th tripeptide occurring riik or more times 
in the A:-th VKC subfamily as following: 

^' N 

where CLik is the CL of the z-th tripeptide in the A:-th VKC subfamily. Based on small probability event 
principle, if Pit is a small value, it means the tripeptide / appearing in VKC subfamily k is not random. 

There are six VKC subfamilies in the current study, namely k = 1, 2, 3, 4, 5, 6. Hence, for an 
arbitrary tripeptide i, it has six CLs corresponding to six VKC subfamilies. Then, we may define the 
probability of tripeptide i in benchmark dataset as: 

CZ,. =inax{CZ,.Kvl,CZ,K^2.<^AKv3'<^AKv4.C'AKv6'C'AKv7} 0' = 1, 2, • • • , 8000) (8) 

It should be noted that the larger the CLi is, the more likely this feature has a better discriminative 
capability. Therefore, we ranked the tripeptides according to their CLi. Based on the ranked tripeptides, 
we used the Incremental Feature Selection (IFS) strategy to find an optimal subset of features that 
gives the highest overall accuracy. During the IFS procedure, the feature subset started with one 
feature with the largest CL. A new feature subset was composed when one feature with the second 
largest CL had been added. By adding features one by one from larger to smaller rank, this process 
repeated 8000 times until all the features were evaluated. Thus, the 8000 feature sets thus formed 
would be composed of 8000 ranked features. The r-th feature set can be formulated as: 

S. = {/i,/2, -U-J.) (l<r<8000) (9) 

where fi has been defined by Equation (5). For each of the feature sets, the cross-validation test was 
used to investigate the accuracy by using proposed predictive algorithm. Through the method referred 
above, we got an IFS curve in Descartes Curvilinear Coordinate System, which used x as X axis, CL as 
7 axis and overall accuracy as Z axis. The optimal feature set is expressed as: 

Se={/i'/2. ■■■fi,---,fe} (10) 
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with which the IFS curve reaches its peak. In other words, in the 3D Cartesian coordinate system, 
when X= @, the value of overall accuracy is the maximum. Thus, we used the 0 features to build the 
final predictor. 

2.4. Support Vector Machine 

Support vector machine has been widely applied in bioinformatics [15-20]. The basic idea of 
applying SVM to pattern classitication is to map samples with low dimensional feature space into a 
high dimensional space, and then seek an optimal separating hyperplane with the maximal margin in 
this space by using the decision function: 



where x, is the i-th training vector. The yi represents the type of the i-th training vector, a, is 

coefficient which can be solved by quadratic programming. The b is the intercept parameter. k{X,Xi) 

is a kernel function which defines an inner product in a high dimensional feature space. Because of its 
effectiveness and speed in nonlinear classification process, the radial basis kernel function (RBF) 



The traditional SVM was designed for two-class problems. For handling a multi-class problem, 
"one-versus-one (OVO)" and "one-versus-rest (OVR)" are often applied to extend the traditional 
SVM. The present study adopted OVO strategy for multi-class prediction. The software toolbox used 
to implement SVM is LibSVM [21]. A grid search method was used to optimize the regularization 
parameter c and kernel parameter y by using cross-validation test. The search spaces for c and y are 
(2^^, 2~^) and (2~^, 2~^^) with steps being 2~^ and 2, respectively. 

2.5. Prediction Assessment 

The predictive capability and reliable of the method is estimated by the four parameters: the 
sensitivity {Sn), specificity {Sp), Matthew's correlation coefficient (MCC) and overall accuracy (OA), 
which were employed to measure the performance of the method and can be defined as follows: 



f(h = sgnif^y^a, ■K(X,X,) + b) 



N 



(11) 



K(X.,Xj) = exp(-y \\ X-Xj ||^) was used to in this work. 



Sn,= 



TP,+FN, 



(12) 



(13) 



TN,+FP, 



MCC^ = 



TP,xTN,-FP,xFN, 



^{TP, +FN,){TN,+FP,){TP,+FP,){TN,+FN,) 



(14) 



(15) 



where k is the k-t\\ VKC subfamily, N is the total sequence number of benchmark dataset. TP^, TNt, 
FPk and FNk represent true positive, true negative, false positive and false negative of the A:-th VKC 
subfamily, respectively. 
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3. Experimental 

In statistical prediction, the following four cross-validation test methods were often used to build a 
predictor for its effectiveness in practical application: self-consistency test, independent dataset test, 
n-fold cross-validation and jackknife cross-validation. Among them, the jackknife test method makes 
best use of the data, involves no random sub-sampling and achieves unique results [6,22]. It has 
been widely and increasingly adopted in bioinformatics [5,12-14,23-25]. Therefore, the jackknife 
cross-validation was used in all procession of feature selection and parameter optimization of S VM. 

Based on Equations (4)-(5), we may define the 8000 tripeptide composition as the original feature 
set. Generally, the larger the feature set is, the more information the representation bears. However, 
the tripeptides with low CL (or large posterior probability) maybe randomly appear in six VKC 
subfamilies. Including these tripeptides into feature set will add redundant information or reduce the 
cluster-tolerant capacity so as to lower down the cross-validation accuracy. For example, 8000 tripeptides 
can only produce the overaU accuracy of 92.17% for predicting different VKC subfamilies. In contrast, 
the tripeptides with larger CL (or small posterior probability) give more reliable information for 
classification. The occurrence of these tripeptides prefers to different VKC subfamilies. However, if 
the number of tripeptide in feature set is very small, they are still not the optimized features for 
prediction because they cannot refiect real characteristics of VKCs and afford enough information, 
which deduces the poor predictive accuracy. For instance, by selecting 29 tripeptides with CL~100% 
(p value = 10"^), we can only achieve 81.1 0%. 

Therefore, it is a key step to obtain the best feature set which can product the maximum overall 
accuracy. According to the equation from Equation (6) to (9), we calculated the cross-validated accuracy 
of all 8000 feature sets using SVM and plotted a three-dimensional curve for CL, feature dimension 
and OA in Figure 2. As we can see from Figure 2, the overall accuracy reaches its maximum of 
96.77% when the CL is selected as 99.99%). The optimized feature set contains 648 tripeptides. Results 
in Table 2 show that the average Sn and average Sp are 93.92%) and 99.20%), respectively, indicating 
that the proposed method is indeed very powerful in identifying proteins which belongs to different 
subfamilies of VKCs. 

Recently, the optimized dipeptide composition (DPC) and amino acid composition (AAC) selected 
by Correlation-based Feature Subset Selection (CFSS) algorithm were used to predict six VKC 
subfamilies by Chen and Lin [5]. In jackknife cross-validation, the overall accuracies of 93.09%, 
85.71%) and 82.03%) were obtained by SVM, Naive Bayes and Random Forest, respectively. The 
comparative results in Table 2 demonstrate that the method proposed in this paper is superior to the 
published methods [5]. 
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Figure 2. The IFS curve (red) in a 3D Cartesian coordinate system for predicting six 
subfamilies of VKCs. The blue, green and yellow lines are the projections of the IFS curve 
on the Overall accuracy/Confidence level plane, the Overall accuracy/Feature dimension 
plane, the Feature dimension/Confidence level plane, respectively. 




Table 2, Comparison with other published methods. 



Family 


This Paper 


SVM [5] 


Naive Bayes [5] 


Random Forest [5] 


Sn (%) 


Sp (%) 


MCC 


Sn (%) 


Sp (%) 


MCC 


Sn (%) 


Sp (%) 


MCC 


Sn (%) 


Sp (%) 


MCC 


Kvl 


100.00 


96.30 


0.95 


93.90 


93.98 


0.86 


93.90 


83.85 


0.76 


97.56 


78.51 


0.76 


Kv2 


93.75 


100.00 


0.96 


87.50 


98.95 


0.86 


81.25 


100.00 


0.89 


75.00 


98.78 


0.82 


Kv3 


97.30 


98.89 


0.95 


89.19 


97.69 


0.87 


81.08 


95.12 


0.75 


59.45 


97.44 


0.67 


Kv4 


100.00 


100.00 


1.00 


93.75 


100.00 


0.96 


87.50 


100.00 


0.92 


65.38 


98.73 


0.75 


Kv6 


80.00 


100.00 


0.89 


100.00 


100.00 


1.00 


40.00 


100.00 


0.62 


80.00 


98.82 


0.87 


Kv7 


92.50 


100.00 


0.95 


95.00 


99.39 


0.95 


85.00 


98.70 


0.87 


85.00 


99.29 


0.89 


Average 
Sn (%) 


93.92 


93.22 


78.12 


77.07 


Average 
Sp (%) 


99.20 


98.34 


96.28 


95.26 


OA (%) 


96.77 


93.09 


85.71 


82.03 



For verifying the advantage of optimized tripeptide composition, it is necessary to investigate the 
performance of other parameters. Hence, we estimated the accuracies of traditional pseudo amino acid 
composition (PseAAC) [6], optimal tripep tides combined with PseAAC and optimal tripeptides 
combined with dipep tides on six subfamilies of voltage-gated ion channels. Results were recorded in 
Table 3. It is obviously that the optimized tripeptide composition is superior to other parameters. It 
should be noted that the two mixture features can only achieve the overall accuracies of 96.31% and 
95.39% which are lower than that (96.77%)) of our optimal tripeptides, suggesting that information 
redundancy or noise were included in mixture feature sets. 

For testifying the capability of the proposed feature selection technique, a powerful feature selection 
technique, namely SVM-RFE [26,27], was introduced to optimize the tripeptides. Subsequently, the 
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IFS strategy was used to find an optimal subset of features that gives the highest overall accuracy. The 
maximum accuracy was recorded in Table 3. Comparison demonstrated that our feature selection 
technique is more powerful. 



Table 3. Comparison with different methods on training set. 



Method 


Kvl 


Kv2 


Sn (%) 
Kv3 Kv4 


Kv6 


Kv7 


OA (%) 


Optimal tripeptides (Our method) 


100.00 


93.75 


97.30 


100.00 


80.00 


92.50 


96.77 


Optimal tripeptides (SVM-RFE) 


100.00 


81.25 


91.67 


96.88 


80.00 


87.55 


93.09 


Traditional PseAAC 


82.93 


81.25 


72.97 


78.13 


80.00 


87.50 


81.11 


Optimal tripeptides (Our method) + PseAAC 


100.00 


87.50 


97.30 


100.00 


80.00 


92.50 


96.31 


Optimal tripeptides (Our method) + Dipeptides 


100.00 


81.25 


94.59 


100.00 


80.00 


92.50 


95.39 



4. Conclusions 

In this work, we developed a promising feature selection technique to optimize feature set and 
applied these selected features to identify six VKC subfamilies. An overall accuracy of 96.77% was 
achieved, demonstrating that the proposed model is a powerful tool for the study of VKC subfamilies 
prediction. For the convenience of experimental scientists, a free web server iVKC-OTC was built to 
implement the prediction. A friendly guide was given to describe the way to use the iVKC-OTC web 
server. We believe that the predictor will be helpfiil for wet lab scientists who focus on VKC research. 
We hope the predictor will pave the way for the fiiture research of VKC. 

5. Web-Server and User Guide 

Establishing a user- friendly web-server will improve the efficiency and avoid repeating a complicated 
mathematics and program for studying VKC. The predictor established via aforementioned procedures is 
called iVKC-OTC, where "i" stands for "identify", "VKC" for "Voltage-gated K+ channel" and "OTC" 
for "optimized fripeptide composition". 

For the convenience of the vast majority of experimental scientists, we provided a guide on how to 
use the web-server to get the desired results. 

Step 1. Open the web server and you will see the top page of iVKC-OTC [28] on your computer 
screen, as shown in Figure 3 Click on the Read Me button to see a brief infroduction about the 
predictor and the caveat when using it. 

Step 2. Either type or copy/paste the query peptide sequences into the input box at the center of 
Figure 3 The input sequence should be in the FASTA format. A sequence in FASTA format consists of 
a single initial line beginning with a greater-than symbol (">") in the first column, followed by lines of 
sequence data. The words right after the symbol in the single initial line are optional and only used 
for the purpose of identification and description. The sequence ends if another line starting with a 
appears; this indicates the start of another sequence. Example sequences in FASTA format can be seen 
by clicking on the Example button right above the input box. 
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Figure 3. A semi-screenshot for the top page of the iVKC-OTC. 




Step 3. CUck on the Submit button to see the predicted resuh. After choking the Submit button, you 
will see the following shown on the screen of your computer: the outcome for the 1st query example is 
"Kvl subfamily protein"; the outcome for the 2nd query sample is "Kv2 subfamily protein"; the 
outcome for the 3rd query sample is "Kv3 subfamily protein"; the outcome for the 4th query sample is 
"Kv4 subfamily protein"; the outcome for the 5th query sample is "Kv6 subfamily protein" and the 
outcome for the 6th query sample is "Kv7 subfamily protein". All these results are fully consistent 
with the experimental observations. It takes about few seconds for the above computation before the 
predicted result appears on your computer screen; the more number of query sequences and longer of 
each sequence, the more time it is usually needed. 

Step 4. Click on the Data button to download the benchmark datasets used to train and test the 
iVKC-OTC predictor. 

Step 5. Click on the Citation button to find the relevant papers that document the detailed 
development and algorithm of iVKC-OTC. 

Caveats. Each of the input query sequences cannot any illegal character: such as "B", "X", "U", "Z". 
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